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Abstract 

Precise  calculation  of  molecular  electronic  wavefunctions  by  methods  such  as  coupled-cluster  re¬ 
quires  the  computation  of  tensor  contractions,  the  cost  of  which  has  polynomial  computational  scaling 
with  respect  to  the  system  and  basis  set  sizes.  Each  contraction  may  be  executed  via  matrix  multipli¬ 
cation  on  a  properly  ordered  and  structured  tensor.  However,  data  transpositions  are  often  needed  to 
reorder  the  tensors  for  each  contraction.  Writing  and  optimizing  distributed-memory  kernels  for  each 
transposition  and  contraction  is  tedious  since  the  number  of  contractions  scales  combinatorially  with  the 
number  of  tensor  indices.  We  present  a  distributed-memory  numerical  library  (Cyclops  Tensor  Frame¬ 
work  (CTF))  that  automatically  manages  tensor  blocking  and  redistribution  to  perform  any  user-specified 
contractions.  CTF  serves  as  the  distributed  memory  contraction  engine  in  Aquarius,  a  new  program  de¬ 
signed  for  high-accuracy  and  massively-parallel  quantum  chemical  computations.  Aquarius  implements 
a  range  of  coupled-cluster  and  related  methods  such  as  CCSD  and  CCSDT  by  writing  the  equations  on 
top  of  a  C++  templated  domain-specific  language.  This  DSF  calls  CTF  directly  to  manage  the  data  and 
perform  the  contractions.  Our  CCSD  and  CCSDT  implementations  achieve  high  parallel  scalability  on 
the  BlueGene/Q  and  Cray  XC30  supercomputer  architectures  showing  that  accurate  electronic  structure 
calculations  can  be  effectively  carried  out  on  top  of  general  distributed  memory  tensor  primitives. 


1  Introduction 

Quantum  chemistry  is  the  field  of  science  focused  on  the  application  of  quantum  mechanics  to  the  study  of 
electrons  in  molecules.  Among  the  most  common  methods  of  quantum  chemistry  are  many-body  (QMB) 
methods,  which  attempt  to  explicitly  solve  the  Schrodinger  equation  using  a  variety  of  models.  The  ex¬ 
plicit  treatment  of  electrons  in  molecules  leads  to  a  steep  computational  cost,  which  is  nonetheless  often 
of  polynomial  complexity,  but  with  the  benefit  of  systematic  improvement  achieved  through  appropriate 
elaborations  of  the  models.  The  coupled-cluster  (CC)  family  of  methods  [54,  5,  11]  is  currently  the  most 
popular  QMB  method  in  chemistry  due  to  its  high  accuracy,  polynomial  time  and  space  complexity,  and 
systematic  improvability.  Coupled-cluster  excels  in  the  description  of  molecular  systems  due  to  its  ability  to 
accurately  describe  electron  correlation  -  the  dynamic  effect  of  each  electron  on  the  others.  In  simpler  (and 
hence  cheaper)  methods,  electron  correlation  is  either  neglected  in  favor  of  an  averaged  interaction  (as  in 
self-consistent  field  theory  [43,  39])  or  approximated  by  an  assumed  functional  form  as  in  DFT  [38],  while 
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correlation  is  treated  explicitly  in  CC  methods  for  pairs  of  electrons  (CCSD),  triples  (CCSDT),  and  so  on  in 
a  systematically  improvable  way.  Additionally  CC  is  rigorously  size-extensive  [5]  and  easily  extensible  to 
excited  states  [51],  derivatives  [45,  52],  and  properties  [33].  This  paper  focuses  on  the  fundamental  kernels 
of  coupled-cluster  calculations  -  tensor  contractions  -  and  documents  a  completely  new  set  of  parallel  al¬ 
gorithms  implemented  as  a  distributed-memory  tensor  contraction  framework,  Cyclops  Tensor  Framework 
(CTF)  1 .  CTF  has  enabled  coupled-cluster  with  excitations  of  up  to  three  electrons  (CCSDT)  to  scale  on 
state-of-the-art  architectures  while  achieving  a  high  degree  of  efficiency  in  computation,  communication 
and  storage. 

Any  tensor  contraction  can  be  performed  via  a  series  of  index  reorderings  and  matrix  multiplications. 
Parallel  matrix  multiplication  is  a  well-studied  problem  and  existing  algorithms  are  capable  of  achieving 
high  efficiency  for  the  problem  on  a  variety  of  scales/architectures.  In  the  field  of  dense  linear  algebra,  op¬ 
timized  numerical  libraries  have  achieved  success  and  popularity  by  exploiting  the  efficiency  of  primitives 
such  as  matrix  multiplication  to  provide  fast  routines  for  matrix  operations  specified  via  a  high-level  inter¬ 
face.  CTF  raises  the  level  of  abstraction  to  provide  contraction  routines  which  employ  library-optimized 
matrix  multiplication  calls  to  maintain  efficiency  and  portability.  In  addition  to  contractions,  CTF  provides 
optimized  distributed  data  transposes  which  can  reorder  tensor  indices  and  extract  sub-tensors.  This  ca¬ 
pability  allows  CTF  to  dynamically  make  data-decomposition  decisions,  which  maintain  load-balance  and 
communication  efficiency  throughout  the  execution  of  any  given  tensor  contraction.  The  high-dimensional 
blocking  used  in  CTF  permits  the  capability  to  exploit  tensor  symmetry  to  lower  computational  and/or  mem¬ 
ory  costs  of  contractions  whenever  possible. 

We  expose  the  generality  of  this  framework  in  the  Aquarius  quantum  chemistry  program  2  via  a  concise 
interface  that  closely  corresponds  to  Einstein  notation,  capable  of  performing  arbitrary  dimensional  contrac¬ 
tions  on  symmetric  tensors.  This  interface  is  a  domain  specific  language  well-suited  to  theoretical  chemists. 
To  demonstrate  correctness  and  performance  we  have  implemented  coupled-cluster  methods  with  single, 
double,  and  triple  excitations  (CCSDT)  using  this  infrastructure. 

The  contributions  of  this  paper  arc 

•  a  structured  communication  pattern  for  tensor  contractions 

•  a  cyclic  tensor  decomposition  for  symmetric  tensors 

•  an  automatic  topology-aware  mapping  framework  for  tensor  contractions 

•  a  load-balanced  blocking  scheme  for  symmetric  tensors 

•  optimized  redistribution  kernels  for  symmetric  tensors 

•  an  expressive  and  compact  tensor  domain  specific  language  (DSL) 

•  a  scalable  implementation  of  coupled-cluster  up  to  triple  excitations 

In  this  paper,  we  will  start  by  detailing  previous  and  related  work  in  Section  2  and  overviewing  coupled- 
cluster  theory  at  the  start  of  Section  3.  We  give  examples  of  some  of  the  bottleneck  contractions  in  coupled- 
cluster  singles  and  double  (CCSD)  and  singles,  doubles,  and  triples  (CCSDT),  along  with  a  preview  of  the 
computational  mechanisms  and  tensor  mappings  which  optimize  these  contractions.  After  presenting  the 
theory,  we  will  discuss  the  architecture  of  Cyclops  Tensor  Framework.  Section  4  discusses  how  tensors 
arc  decomposed  into  blocks  and  how  redistribution  kernels  migrate  data  between  blocked  layouts  as  well 

'Cyclops  Tensor  Framework  is  publicly  available  under  a  BSD  license:  https  :  //github .  com/solomonik/ctf 
2 A  pre-alpha  version  of  Aquarius  is  publicly  available  under  the  New  BSD  license:  https://github.com/ 
de vinamat thews/ aqua rius 
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as  allow  sparse  input/output.  We  will  show  how  CTF  dynamically  makes  mapping  decisions  and  performs 
contractions  in  Section  5.  In  Section  6,  we  evaluate  CTF  by  measuring  its  performance  for  CCSD  and 
CCSDT  on  Blue  Gene/Q  and  Cray  XC30  architectures,  which  show  good  weak  scaling  and  outperform 
other  codes  designed  for  similar  computations,  namely  NWChem  [8]. 

2  Previous  work 

In  this  section,  we  provide  an  overview  of  existing  applications  and  known  algorithms  for  distributed  mem¬ 
ory  CC  and  tensor  contractions.  We  also  discuss  parallel  numerical  linear  algebra  algorithms  which  will 
serve  as  motivation  and  building  blocks  for  the  design  of  Cyclops  Tensor  Framework. 

2.1  NWChem  and  TCE 

NWChem  [8]  is  a  computational  chemistry  software  package  developed  for  massively  parallel  systems. 
NWChem  includes  an  implementation  of  CC  that  uses  tensor  contractions,  which  arc  of  interest  in  our 
analysis.  We  will  detail  the  parallelization  scheme  used  inside  NWChem  and  use  it  as  a  basis  of  comparison 
for  the  Cyclops  Tensor  Framework  design. 

NWChem  uses  the  Tensor  Contraction  Engine  (TCE)  [20,  6,  16],  to  automatically  generate  sequences 
of  tensor  contractions  based  on  a  diagrammatic  representation  of  CC  schemes.  TCE  attempts  to  form  the 
most  efficient  sequence  of  contractions  while  minimizing  memory  usage  of  intermediates  (computed  tensors 
that  are  neither  inputs  nor  outputs).  We  note  that  TCE  or  a  similar  framework  can  function  with  any  dis¬ 
tributed  library  which  actually  executes  the  contractions.  Thus,  TCE  can  be  combined  with  Cyclops  Tensor 
Framework  since  they  are  largely  orthogonal  components.  However,  the  tuning  decisions  done  by  such  a 
contraction-generation  layer  should  be  coupled  with  performance  and  memory  usage  models  of  the  under¬ 
lying  contraction  framework.  In  addition,  one  of  the  present  authors  (D.A.M)  is  working  on  a  new,  more 
flexible  generator  for  CC  contractions  which  could  be  more  tightly  coupled  to  CTF 

To  parallelize  and  execute  each  individual  contraction,  NWChem  employs  the  Global  Arrays  (GA) 
framework  [34].  Global  Arrays  is  a  partitioned  global-address  space  model  (PGAS)  that  allows  processors 
to  access  data  (via  explicit  function  calls,  e.g..  Put,  Get  and  Accumulate)  that  may  be  laid  out  physically  on 
a  different  processor.  Data  movement  within  GA  is  performed  via  one-sided  communication,  thereby  avoid¬ 
ing  synchronization  among  communicating  nodes,  while  accessing  distributed  data  on-demand.  NWChem 
performs  different  block  tensor  sub-contractions  on  all  processors  using  GA  as  the  underlying  communi¬ 
cation  layer  to  satisfy  dependencies  and  obtain  the  correct  blocks.  NWChem  uses  dynamic  load  balancing 
among  the  processors  because  the  work  associated  with  block-sparse  representation  of  tensors  within  GA  is 
not  intrinsically  balanced.  Further,  since  GA  does  not  explicitly  manage  contractions  and  data  redistribution, 
the  communication  pattern  resulting  from  one-sided  accesses  is  often  irregular.  The  dynamic  load-balancer 
attempts  to  alleviate  this  problem,  but  assigns  work  without  regal'd  to  locality  or  network  topology.  Cyclops 
Tensor  Framework  eliminates  the  scalability  bottlenecks  of  load  imbalance  and  irregular  communication, 
by  using  a  regular  decomposition  which  employs  a  structured  communication  pattern  especially  well-suited 
for  torus  network  architectures. 

2.2  ACES  III  and  SIAL 

The  ACES  III  package  uses  the  SIAL  framework  [31,  14]  for  distributed  memory  tensor  contractions  in 
coupled-cluster  theory.  Like  the  NWChem  TCE,  SIAL  uses  tiling  to  extract  parallelism  from  each  tensor 
contraction.  However,  SIAL  has  a  different  runtime  approach  that  does  not  require  one-sided  communica¬ 
tion,  but  rather  uses  intermittent  polling  (between  tile  contractions)  to  respond  to  communication  requests, 
which  allows  SIAL  to  be  implemented  using  MPI  two-sided  communication. 
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2.3  MRCC 


MRCC  [25]  is  unique  in  its  ability  to  perform  arbitrary-order  calculations  for  a  variety  of  CC  and  related 
methods.  Parallelism  is  enabled  to  a  limited  extent  by  either  using  a  multi-threaded  BLAS  library  or  by 
parallel  MPI  features  of  the  program.  However,  the  scaling  performance  is  severely  limited  due  to  highly 
unordered  access  of  the  data  and  excessive  inter-node  communication.  MRCC  is  currently  the  only  tenable 
solution  for  performing  any  type  of  CC  calculation  which  takes  into  account  quadruple  and  higher  exci¬ 
tations.  MRCC  uses  a  string-based  approach  to  tensor  contractions  which  originated  in  the  development 
of  Full  Cl  codes  [27,  36].  In  this  method,  the  tensors  arc  stored  using  a  fully-packed  representation,  but 
must  be  partially  unpacked  in  order  for  tensor  contractions  to  be  performed.  The  indices  of  the  tensors  arc 
then  represented  by  index  “strings”  that  arc  pre-generated  and  then  looped  over  to  form  the  final  product. 
The  innermost  loop  contains  a  small  matrix-vector  multiply  operation  (the  dimensions  of  this  operation  arc 
necessarily  small,  and  become  relatively  smaller  with  increasing  level  of  excitation  as  this  loop  involves 
only  a  small  number  of  the  total  indices).  The  performance  of  MRCC  is  hindered  by  its  reliance  on  the 
matrix-vector  multiply  operation,  which  is  memory-bandwidth  bound.  Other  libraries,  including  NWChem 
and  our  approach,  achieve  better  cache  locality  by  leveraging  matrix  multiplication. 

2.4  QChem 

The  QChem  [47]  quantum  chemistry  package  employs  libtensor  [15],  a  general  tensor  contraction  library 
for  shared-memory  architectures.  The  libtensor  framework  exploits  spatial  and  permutational  symmetry 
of  tensors,  and  performs  tensor  blocking  to  achieve  parallelization.  However,  libtensor  does  not  yet  pro¬ 
vide  support  for  distributed  memory  tensor  contractions  and  redistributions,  as  done  by  CTF.  The  libtensor 
module  in  QChem  also  contains  a  tensor  contraction  DSL  somewhat  similar  to  our  own.  One  of  the  main 
differences  between  the  two  DSLs  is  that  in  libtensor,  the  index  symmetry  of  the  output  tensor  is  implicitly 
described  by  the  symmetry  of  the  inputs  and  any  explicit  antisymmetrization  operations  to  be  performed.  In 
our  DSL,  it  is  the  other  way  around  in  that  the  antisymmetrization  operators  arc  implicitly  specified  by  the 
symmetry  of  the  inputs  and  output.  We  feel  that  the  latter  approach  gives  a  more  convenient  and  compact 
representation  of  the  desired  operation,  as  well  as  a  more  robust  one  in  that  antisymmetrization  operators 
must  be  specified  for  each  contraction,  while  the  output  tensor  structure  must  be  specified  only  once.  In 
addition,  the  use  of  repeated  and  common  indices  in  the  tensor  index  labels  and  operator  overloading  in  our 
approach  is  more  general  and  flexible  than  providing  functions  for  each  type  of  common  tensor  operation  as 
in  libtensor. 

2.5  Other  tensor  contraction  libraries 

There  are  other  tensor  contraction  libraries,  which  are  not  paid  of  standard  chemistry  packages  previously 
mentioned  in  this  paper.  One  concurrently-developed  distributed  memory  tensor  library  effort  is  given  by 
Rajbahandri  et  al  [42],  and  shares  many  similarities  with  our  work.  This  library  employs  similar  matrix 
multiplication  primitives  (SUMMA  and  3D  algorithms)  for  distributed  tensor  contractions  and  mapping  of 
data  onto  torus  networks.  A  few  of  the  important  differences  between  our  work  and  this  framework  arc 
overdecomposition  and  the  redistribution  mechanisms.  Recent  studies  have  also  demonstrated  the  efficacy 
of  scheduling  many  different  contractions  simultaneously  within  coupled-cluster  [29],  an  approach  that  is 
particularly  useful  for  higher  order  coupled-cluster  methods.  Our  paper  focuses  on  parallel  algorithms  for 
the  execution  of  a  single  tensor  contraction,  leaving  it  for  future  work  to  integrate  this  single-contraction 
parallelization  with  a  second  layer  of  multi-contraction  parallelism. 

There  have  also  been  efforts  for  efficient  implementation  of  tensor  contractions  for  coupled-cluster 
which  do  not  focus  on  distributed  memory  parallelism.  Hanrath  and  Engels-Putzka  [18]  give  a  sequen¬ 
tial  framework  which  performs  tensor  transpositions  to  efficiently  exploit  threaded  matrix  multiplication 
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primitives  within  tensor  contractions.  Parkhill  and  Head-Gordon  [37]  as  well  as  Kats  and  Manby  [26]  give 
sequential  implementations  of  sparse  tensor  contraction  libraries,  targeted  at  coupled-cluster  methods  which 
exploit  spatial  locality  to  evaluate  a  sparse  set  of  interactions.  Support  for  parallel  sparse  tensor  contractions 
is  not  within  the  scope  of  this  paper,  but  is  being  pursued  as  an  important  direction  of  future  work  since 
many  of  the  parallel  algorithmic  ideas  discussed  in  this  paper  extend  to  sparse  tensors. 

2.6  Matrix  multiplication  algorithms 

Since  tensor  contractions  arc  closely  related  to  matrix  multiplication  (MM),  it  is  of  interest  to  consider  the 
best  known  distributed  algorithms  for  MM.  Ideally,  the  performance  achieved  by  any  given  tensor  contrac¬ 
tion  should  approach  the  efficiency  of  matrix  multiplication,  and  generally  the  latter  is  an  upper  bound. 
Given  a  matrix  multiplication  or  contraction  that  requires  F/p  multiplications  on  p  processors,  with  M 
words  of  memory  on  each  processor,  it  is  known  that  some  processor  must  communicate  at  least 

"  =n(j7i)  (1) 

words  of  data  [22,  4,  21].  If  the  matrices  arc  square  and  of  size  S,  and  there  is  enough  memory  for  only  0(1) 
copies  of  the  matrices  (. S  =  0(M  •  p )),  matrix  multiplication  can  be  performed  on  a  2D  grid  of  processors 
in  a  communication-optimal  fashion.  In  particular,  blocked  Cannon’s  algorithm  [9]  and  SUMMA  [1,  53] 
achieve  the  bandwidth  cost 

W2D  =  0  (jfs) ' 

which  matches  Equation  1  for  S  =  0(47  •  p).  Given  an  unlimited  amount  of  memory,  it  is  possible 
to  replicate  the  data  to  reduce  the  communication  cost,  which  is  realized  by  a  3D  matrix  multiplication 
algorithm  [32,  12,  1,  2,  23]  with  a  resulting  bandwidth  cost  of 

W3D  =  O  f  — - - -j=  J  . 

Vp2/3  •  Vs) 

So,  3D  algorithms  lower  the  communication  cost  over  2D  algorithms  by  a  factor  of  p 1  ;/<’  and  are  in  fact 
optimal  independent  of  the  amount  of  available  memory  [2,  3]  (provided  there  is  enough  memory  to  perform 
the  algorithm). 

In  practice,  most  applications  run  with  some  bounded  amount  of  extra  available  memory.  2.5D  algo¬ 
rithms  minimize  communication  cost  for  any  amount  of  physical  memory  [50] .  In  particular,  if  all  operand 
tensors  or  matrices  arc  of  size  S  =  0(M  •  p/c),  where  c  £  [  1 , yj 1 ' 3 ] ,  the  communication  cost  achieved  by 
2.5D  matrix  multiplication  is 

W25D  =  0  (vF^)  ’ 

which  is  a  factor  of  y/c  smaller  than  the  2D  algorithm  (when  c  =  jV:‘,  the  2.5D  algorithm  reduces  to  a  3D 
algorithm).  The  additional  tunable  parallelization  dimension  provides  an  important  additional  parameter  for 
mapping  onto  torus  network  architectures  as  demonstrated  in  Ref.  [49]. 

Additional  algorithmic  considerations  arise  when  we  consider  matrix  multiplication  of  non-square  ma¬ 
trices  [44,  13].  In  particular,  it  is  possible  to  move  any  two  of  the  three  matrices  in  2D  algorithms,  and 
to  replicate  only  one  of  three  in  3D  algorithms.  To  reduce  communication,  it  is  best  to  move  the  smaller 
matrices,  as  the  size  of  the  matrices  differs  when  they  arc  non-square.  The  case  of  rectangular  matrices 
is  particularly  relevant  to  tensor  contractions  in  coupled-cluster  calculations,  which  arc  performed  on  ten¬ 
sors  with  different  edge-lengths  and  dimension,  and  therefore  of  varying  size.  Cyclops  Tensor  Framework 
attempts  to  avoid  communication  by  exploiting  replication  for  tensors  of  any  given  size  by  dynamically 
selecting  a  parallel  decomposition  with  the  least  estimated  communication  cost. 
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3  Coupled-cluster  Methods 


Coupled-cluster  [54,  5,  11]  is  an  iterative  process,  where  in  each  iteration,  the  new  set  of  amplitudes  T7  is 
computed  from  the  amplitudes  from  the  previous  iteration  T  and  from  the  (fixed)  Hamiltonian  H  =  F  +  V. 
The  diagonal  elements  of  the  one-particle  Hamiltonian  F  arc  separated  out  as  a  factor  —  D,  giving  a  final 
schematic  form  similar  to  a  standard  Jacobi  iteration  (although  V  still  contains  diagonal  elements), 


T7  =  D 


-i 


(F7  +  V)(l  +  T  +  -T2  +  -T3  +  —  T4) 
k  A  2  6  24 


The  expansion  of  the  exponential  operator  is  complete  at  fourth  order  due  to  the  fact  that  the  Hamiltonian 
includes  only  one-  and  two- particle  (two-  and  four-index)  parts  and  must  share  an  index  with  each  T  operator 
(this  is  the  meaning  of  the  c  subscript).  F7,  V,  and  T  arc  defined  in  terms  of  fundamental  tensors  by 


F7 

—  (1  —  dab)  fb  +  ft  +  /a  +  (1  —  %)/]> 

V 

=  v$  +  v$  +  v&  +  v$  +  v$  + 

Vab  +  Vt+Vak+VW 

T 

=  Tx  +  T2  +  •  •  •  +  Tn 

(2) 

where  the  abode  f  . . .  indices  refer  to  virtual  (unoccupied)  orbitals  while  ijklmn . . .  refer  to  occupied  or¬ 
bitals.  The  amplitude  (T)  and  integral  (F7  and  V)  tensors  arc  anti- symmetric  with  respect  to  permutation  of 
indices  which  arc  both  virtual  or  occupied  and  in  the  same  (up  or  down)  position,  e.g.  we  have 


vab  =  - vba 
IJ  *7 


=  vfi  =  -V 


ab 
ji  * 


The  contractions  which  must  be  done  can  be  derived  using  either  algebraic  or  diagrammatic  techniques  [5], 
however  the  result  (denoted  as  Z,  for  coupled-cluster  this  is  the  residual  in  an  iterative  Jacobi  procedure  for 
the  solution  of  T)  is  a  sum  of  contractions  exemplified  by 

va  —  I  i,am+ef 

zi  9  /  y  vef  ^ im 
efm 


ab  _  \  ST  „,mn.ef.ab 

o  4  Uef  imn' 

efmn 

where  the  summation  index  /  is  not  to  be  confused  with  the  one-electron  integral  tensor.  Contractions  which 
involve  multiple  T  tensors  arc  factored  into  a  sequence  of  contractions  involving  one  or  more  intermediates, 
such  that  each  contraction  is  a  binary  tensor  operation. 

The  equations,  as  written  above,  arc  termed  the  “spin-orbital”  representation  in  that  the  indices  arc 
allowed  to  run  over  orbitals  of  either  a  or  j3  spin,  while  only  amplitudes  with  certain  combinations  of  spin 
are  technically  allowed.  Some  programs  use  this  representation  directly,  checking  each  amplitude  or  block 
of  amplitudes  individually  to  determine  if  it  is  allowed  (and  hence  should  be  stored  and  operated  upon). 
However,  an  alternative  approach  is  the  use  of  the  spin-integrated  equations  where  each  index  is  explicitly 
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spin- a:,  abij  . . . ,  or  spin-/?,  abij  ....  For  example,  the  second  contraction  above  becomes, 


ab 

Aj 

_  1  \  mnj.ef.ab 

4  Z_/  Uef  tmn’ 

efmn 

ab 

Zi~j 

_  ST  Q,mnjef.ab 
~  Z^  ef  bij  bmni 

efmn 

ab 

Zr3 

_  1  j.rnnjef.ab 

~  \  /  j  uef  Lij  bmni 

efmn 


While  the  number  of  contractions  is  increased,  the  total  amount  of  data  which  must  be  stored  and  contracted 
is  reduced  compared  to  a  naive  implementation  of  the  spin-orbital  method,  and  without  the  overhead  of 
explicit  spin-checking. 

The  spin-integrated  Hamiltonian  elements  (/  and  v  tensors)  and  amplitudes  (t  tensors)  also  have  implicit 
permutational  symmetry.  Indices  which  appeal-  together  (meaning  either  both  upper  or  lower  indices  of  a 
tensor)  and  which  have  the  same  spin  and  occupancy  may  be  interchanged  to  produce  an  overall  minus  sign. 
In  practice  this  allows  the  amplitudes  to  be  stored  using  the  symmetric  packing  facilities  built  into  CTF. 

3.1  Domain  Specific  Language  for  Tensor  Operations 

Aquarius  provides  an  intuitive  domain  specific  language  for  performing  tensor  contractions  and  other  tensor 
operations.  This  interface  is  implemented  using  operator  overloading  and  templating  in  C++,  with  the  end 
result  that  tensor  contractions  can  be  programmed  in  the  exact  same  syntax  as  they  are  defined  algebraically, 

~ab  _  ST'  mn+ef.ab 
ij  ef  bij  Lrnn 

efmn 

W[“MnIj”]  =  V[“MnEf ”]  *T[“EfIj”]; 

Z[“AbIj”]  =  W[“MnIj”]  *  T[“AbMn”]; 

This  interface  naturally  supports  all  types  of  tensor  operations,  not  just  contraction.  The  number  and  place¬ 
ment  of  the  unique  indices  implicitly  define  the  operation  or  operations  which  are  to  be  performed.  For 
example,  the  repetition  of  an  index  within  an  input  tensor  which  does  not  appear  in  the  output  tensor  defines 
a  trace  over  that  index.  Similarly,  an  index  which  appeal's  in  all  three  tensors  defines  a  type  of  “weighting” 
operation  while  an  index  which  appears  multiple  times  in  the  input  and  once  in  the  output  will  operate  on 
diagonal  or  semi-diagonal  elements  of  the  input  only.  The  weighting  operation  deserves  special  attention  as 
it  is  required  in  CC  to  produce  the  new  amplitudes  T'  from  Z  =  HeT, 

T7  =  D_1Z 

4 


Dinv[“AbI  j”]  =  l/D[“AbIj”]; 

T[“AbIj”]  =  Dinv[“AbIj”]  *  Z[“AbIj”]; 
or  simply, 

T[“AbIj”]  =  Z[“AbIj”]/D[“AbIj”]; 
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Additionally,  Equation-of-Motion  CC  (EOM-CC)  and  many  other  related  techniques  have  terms  that  require 
computation  of  only  the  diagonal  elements  of  a  tensor  contraction  or  require  replication  of  the  result  along 
one  or  more  dimensions,  both  of  which  can  be  expressed  easily  and  succinctly  in  this  interface.  For  example, 
the  diagonal  tensor  elements  used  in  EOMIP-CCSD  include  terms  such  as, 

//"'•'  <  Wlf  and 

aij  i] 

Hatj.  <-  YvM, 

aij  /  ae  ’ 

e 

which  can  be  expressed  in  Aquarius  as, 

Hbar[“AIj”]  +=  W[“ljlj”]; 

Hbar[“AIj”]  +=  V[“IjAe”]  *  T[“AeI j”]; 

3.2  Application  to  CCSD  and  CCSDT 

The  CCSD  model  [41],  where  T  =  Ti  +  T2  (i.e.  n  =  2  in  Equation  2),  is  one  of  the  most  widely  used 
coupled-cluster  methods  as  it  provides  a  good  compromise  between  efficiency  and  accuracy,  and  is  straight¬ 
forward  to  derive  and  implement.  In  particular,  CCSD  is  only  slightly  more  computationally  expensive 
than  the  simpler  CCD  method  [46]  but  provides  greater  accuracy,  especially  for  molecular  properties  such 
as  molecular  structure  and  those  derived  from  response  theory.  Extending  this  to  CCSDT  [30,  35],  where 
T  =  Ti  +  T2  +  T3  (n  =  3),  gives  an  even  more  accurate  method  (often  capable  of  what  is  commonly 
called  “chemical  accuracy”),  at  the  expense  of  a  steeper  scaling  with  system  size  and  vastly  more  compli¬ 
cated  equations.  Formally,  CCSD  and  CCSDT  have  leading-order  costs  of  0{n20n „)  and  where 

nQ  and  nv  are  the  number  of  occupied  and  virtual  orbitals,  respectively. 

The  most  expensive  contractions  in  each  method  are  those  which  involve  the  highest  order  T  tensor  (T2 
for  CCSD,  T3  for  CCSDT),  multiplied  by  either  the  integrals  V  or  (equivalently  in  terms  of  computational 
cost)  by  a  4-dimensional  intermediate  tensor.  For  example,  in  CCSD  the  most  expensive  set  of  contractions 
is  the  “particle -particle  ladder”  term. 


z?b  =  - 


y 


ab 

Zfj 


~ab 

zTj 


It 


„  abfef 
2  Z—/  Vejij  ’ 
ef 


ef 

1 


E 


V-brt~. 
2  Z—j  ef  1 J 

ef 


When  the  spins  of  ab  and  ij  are  the  same  (the  first  and  third  contractions),  the  antisymmetry  of  these 
index  pairs  and  of  the  ef  index  pair  is  conserved  in  all  of  the  tensors.  This  is  taken  advantage  of  in  CTF 
by  folding  each  of  these  index  pairs  into  a  single  index  of  length  n(n  —  l)/2.  Additionally,  in  the  second 
contraction  the  ab,  ij,  and  ef  pairs  can  similarly  be  combined  after  the  tensor  has  been  properly  remapped, 
if  doing  so  will  reduce  the  communication  volume.  Folding  will  be  discussed  in-depth  in  the  next  section. 
Then,  the  contractions  reduce  to  (relatively)  simple  matrix  multiplication.  However,  the  situation  is  more 
complicated  for  the  second-most  expensive  set  of  contractions  (the  “ring”  term,  designated  as  such  by  the 


appearance  of  the  associated  diagrammatic  representation), 


*,ab  _  pi  pa  I  rn.bj.ae  ,  bm.j.ae 

* ij  1  j1  b  ]  /  y  wej  lim  '  /  j  wje  Lim 

\  era  era 

4  =  E<e+E”0i«+E”’SsCj+ 

era  era  era 

E  + E  +  E  <n4V 

era  era  era 

4  =  ^{e^s+E^s 

where  is  an  intermediate  and  the  permutation  operator  ]  =  [. . .  a  ...  b b ...  a ...  ] 

Two  complications  arise  here:  first,  the  symmetric  index  groups,  such  as  ab,  ij,  etc.  in  the  output,  or  ae, 
fhj,  etc.  in  the  input  (none  of  the  index  groups  on  w  is  antisymmetric  since  the  indices  are  of  different 
occupancies)  are  “broken”,  as  they  do  not  always  appear  together.  Second,  the  first  and  third  equations 
require  explicit  antisymmetrization  through  the  Pf  operators.  Broken  symmetries,  discussed  more  in-depth 
in  the  next  section,  prevent  folding  of  the  index  pair  and  are  handled  in  CTF  either  by  unpacking  the  bro¬ 
ken  symmetries  or  by  performing  a  sequence  of  triangular  contractions  on  the  packed  data.  Similarly  the 
antisymmetrization  from  P'j  for  example  can  be  incorporated  into  the  handling  of  the  ab  symmetry  either 
by  explicitly  antisymmetrizing  while  repacking  the  data  if  it  has  been  unpacked,  or  by  altering  the  sign 
appropriately  for  individual  packed  contractions. 

These  same  types  of  contractions  are  again  the  most  expensive  ones  for  CCSDT,  except  that  now  the  T3 
tensor  is  involved.  For  the  particle-particle  ladder  term. 


abc  __ 
^ ijk 

pab  1 
c  2 

E«S' 

f.efc 
Jijk  ’ 

ef 

abc. 

* ijk 

=  XE 
2  ^ 

„,ab+efc 

Veftijk 

+  Pba 

E 

ef 

,aef 
ijk  ' 

ef 

ef 

abc 

tjk 

-  -E 

2  ^ 

ef  ijk 

+  pI 

E 

va\ 

ef 

j.efc 
ijk  ’ 

ef 

ef 

abc  _ 
ZTjk  ~ 

pab  1 
5  2 

E  # 

4 If 
ijk  ’ 

ef 


where  Pjb  generalizes  Pjj  by  interchanging  (antisymmetrizing)  both  a  and  b  with  c.  Similarly  to  the 
case  in  CCSD,  the  contractions  with  sums  over  ef  with  the  same  occupancy  represent  preserved  symmetries 
in  ef  and,  where  ef  and  c  are  of  opposite  occupancies,  in  ab  as  well.  These  index  pairs  can  then  be  folded. 
However,  in  the  first  and  fourth  equations,  the  three-index  symmetry  over  abc  is  broken,  and  explicit  anti¬ 
symmetrization  is  required.  In  the  case  of  a  three-index  symmetry,  though,  the  unpacking  can  be  partial,  in 
that  the  preserved  symmetry  of  ab  can  be  retained  and  these  indices  folded.  Multiple  triangular  contractions 
(or  more  precisely,  the  higher-dimensional  analogue)  can  also  be  performed  on  the  fully-packed  data. 

In  the  Aquarius  tensor  DSL,  optimizations  due  to  preserved  symmetries,  handling  of  broken  symmetries, 
and  explicit  permutation  operators  are  implied  by  the  symmetry  relations  in  the  specification  of  the  input  and 
output  tensors.  This  results  in  significant  reduction  and  simplification  of  the  required  code.  For  example,  if 
two  antisymmetric  matrices  are  multiplied  together  and  the  result  is  then  antisymmetrized  (which  is  signaled 
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by  specifying  that  the  output  matrix  is  itself  antisymmetric),  the  result,  in  terms  of  fully  packed  storage 
requires  six  separate  operations  to  handle  the  broken  symmetry  and  explicit  antisymmetrization.  All  of 
these  operations  can  be  represented  by  a  single  contraction  call  in  the  Aquarius  interface  if  the  output  tensor 
is  specified  to  have  antisymmetry, 

C[ab\  =  P£A[ac]x  B[cb], 

C[a  <  b\  =  { ^4[a,  <  c]B[c  <  b]  —  A[a  <  c]B[b  <  c] 

C 

— A[c  <  a\B[c  <  b]  —  A[b  <  c\B[c  <  a] 

+A[b  <  c]B[a  <  c]  +  A[c  <  b\B[c  <  a]} 

/*  A,  B,  and  C  are  antisymmetric  */ 

C[“ab”]  =  A[“ac”]  *B[“cb”]; 

An  interface  layer  to  automatically  produce  the  necessary  spin-integrated  contractions  has  also  been 
implemented,  so  that  the  code  can  be  written  entirely  in  terms  of  the  simple  spin-orbital  quantities.  This 
means  that,  for  example,  each  of  the  contraction  sets  above  is  actually  written  as  a  single  and  fairly  simple 
contraction.  With  these  simplifications,  the  total  amount  of  code  to  perform  a  single  CCSD  iteration  is  only 
41  lines,  and  to  perform  a  CCSDT  iteration  is  only  62  lines,  whereas  traditional  implementations  generally 
require  thousands  of  lines  of  hand-written  code  and  automatically-generated  programs  generally  consist  of 
many  thousands  of  lines  (for  example  the  TCE-based  CCSDT  module  in  NWChem  is  almost  12,000  lines). 

3.3  Higher-order  coupled-cluster 

Higher  order  CC  methods  (CCSDTQ  [28],  CCSDTQP,  etc.)  are  theoretically  very  similar  to  CCSD  and 
CCSDT,  however,  several  important  computational  distinctions  arise.  First,  as  the  order  increases,  the  high¬ 
est  set  of  Tn  amplitudes  becomes  vastly  larger  than  the  Hamiltonian  elements  and  the  other  T  amplitudes. 
The  computation  time  in  terms  of  FLOPS  is  dominated  by  just  a  few  contractions  involving  this  largest 
amplitude  set.  However,  the  sheer  number  of  small  contractions  which  must  be  done  in  addition  can  still 
dominate  the  wall  time  if  they  are  not  performed  as  efficiently  or  do  not  parallelize  as  well.  Thus,  the  effi¬ 
ciency  of  small  tensor  contractions  and  strong-scalability  of  the  parallel  algorithm  become  relatively  much 
more  important  for  higher  order  CC,  than  for  the  simpler  models  such  as  CCSD. 

Second,  since  the  total  memory  and/or  disk  space  available  for  the  computation  is  effectively  constant, 
high  orders  of  CC  effectively  demand  the  use  of  a  smaller  number  of  occupied  and  virtual  orbitals.  This 
shrinks  the  length  of  each  tensor  dimension,  threatening  vectorization  and  increasing  indexing  overhead. 
While  extremely  short  edge  lengths  could  still  cause  excessive  overhead  in  this  scheme  (due  to  the  padding 
needed  to  fill  out  the  fixed-length  tiles),  good  performance  should  be  retained  in  most  circumstances.  In 
addition,  the  ratio  of  floating  point  operations  to  tensor  size  (and  hence  memory  operations)  necessarily 
shrinks  for  higher  orders  of  CC  for  the  same  reasons,  so  that  choosing  distributions  which  minimize  data 
movement  of  the  largest  tensors  is  critical.  CTF  accomplishes  this  by  searching  over  the  possible  mappings 
and  automatically  choosing  the  optimal  one  in  terms  of  data  movement  and  memory  cost. 

4  Algorithms  for  tensor  blocking  and  redistribution 

CTF  decomposes  tensors  into  blocks  which  are  cyclic  sub-portions  of  the  global  tensor  and  assigns  them 
to  processors  in  a  regular  fashion.  The  partitioning  is  contraction-specific  and  tensor  data  is  transferred 
between  different  distributions  when  necessary.  Further,  it  is  necessary  to  allow  the  user  to  modify,  enter. 
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and  read  the  tensor  data  in  a  general  fashion.  In  this  section,  we  explain  how  the  cyclic  blocking  scheme 
works  and  give  data  redistribution  algorithms  which  are  designed  to  shuffle  the  data  around  efficiently. 

4.1  Cyclic  tensor  blocking 


Blocked  layout 


Block-cyclic  layout 
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Figure  1 :  The  load-imbalance  incurred  or  padding  necessary  for  blocked,  block-cyclic,  and  cyclic  layouts. 

A  blocked  distribution  implies  each  processor  owns  a  contiguous  piece  of  the  original  tensor.  In  a 
cyclic  distribution,  a  cyclic  phase  defines  the  periodicity  of  the  set  of  indices  whose  elements  are  owned 
by  a  single  processor.  For  example,  if  a  vector  is  distributed  cyclically  among  4  processors,  each  proces¬ 
sor  owns  every  fourth  element  of  the  vector.  For  a  tensor  of  dimension  d,  we  can  define  a  set  of  cyclic 
phases  (pi,P2,  ■  ■  ■  ,Pd),  such  that  processor  Pilti2,...,id  owns  all  tensor  elements  whose  index  ■  ■  ■ ,  .id) 

satisfies 

jk  =  ik  mod  pk 

for  all  k  G  { 1 , . . . ,  d}  and  where  pk  gives  the  length  of  the  processor  grid  in  the  k  —  th  dimension.  A  block- 
cyclic  distribution  generalizes  blocked  and  cyclic  distributions,  by  distributing  contiguous  blocks  of  any 
size  b  cyclically  among  processors.  Cyclic  decompositions  are  commonly  used  in  parallel  numerical  linear 
algebra  algorithms  and  frameworks  such  as  ScaLAPACK  (block-cyclic)  [7]  and  Elemental  (cyclic)  [40]. 
Our  method  extends  this  decomposition  to  tensors. 

Cyclops  Tensor  Framework  employs  a  cyclic  distribution  in  order  to  preserve  packed  symmetric  full 
structure  in  sub-tensors,  minimize  padding,  and  generate  a  completely  regular  decomposition,  susceptible  to 
classical  linear  algebra  optimizations.  Each  processor  owns  a  cyclic  sub-tensor,  where  the  choice  of  cyclic 
phases  in  each  dimension  has  the  same  phase  for  all  symmetric  indices.  By  maintaining  the  same  cyclic 
phase  in  each  dimension,  the  algorithm  ensures  that  each  of  the  sub-tensors  owned  by  any  processor  has  the 
same  fill  structure  as  the  whole  tensor.  For  instance,  CTF  might  decompose  the  integral  tensor  V  which  is 
anti-symmetric  in  two  index  pairs,  into  36  blocks  VWlW2W3W/l,  where  wi,W2  G  {0, 1}  and  m3,  uq  G  {0, 1,  2} 
so  that 


vfj  G  VW1W2W 3W4  ■  G  V,  a  =  w\  mod  2,  b  =  w 2  mod  2,  a  <  b 

i  =  W3  mod  3,  j  =  u?4  mod  3,  i  <  j. 

Each  sub-tensor  VWlW2W3w4  has  the  same  structure  as  the  unique  part  of  V  (vfj  for  a  <  b,i  <  j ),  though 
some  blocks  have  extra  elements  (e.g.  V0101  is  larger  than  Vnoi  since  vfj  =  0).  However,  it  is  important 
to  clarify  that  while  structure  is  preserved,  cyclicity  does  not  preserve  symmetry.  If  we  drop  the  conditions 
a  <  b  and  i  <  j  from  the  definition  of  the  ~VW1W2W3W4  block,  only  the  blocks  VWiwiw3w3  will  have  the  same 
symmetry  as  V  (the  same  would  be  true  if  the  blocks  were  selected  contiguously  rather  than  cyclically). 
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The  cyclic  blocks  differ  in  fill,  but  padding  along  the  diagonals  of  symmetric  indices  allows  all  the  blocks 
to  be  defined  in  the  same  shape.  Figure  1  demonstrates  the  padding  required  to  store  a  lower-triangular 
matrix  in  a  4-by-4  cyclic  layout.  For  comparison,  we  also  show  the  amount  of  padding  one  would  need  to 
preserve  identical  block  structure  in  block-cyclic  and  contiguously-blocked  layouts.  The  cyclic  layout  is  the 
only  one  able  to  preserve  the  fill  structure  and  does  not  require  a  significant  amount  of  padding.  Preserving 
a  regular  layout,  regardless  of  symmetry,  also  necessitates  padding  to  make  the  tensor  edge  lengths  divisible 
by  the  processor  grid  edge  lengths.  In  a  cyclic  layout,  the  amount  of  padding  required  due  to  symmetry  and 
divisibility  grows  with  the  number  of  processors,  presenting  a  potential  limit  to  strong  scaling  (e.g.  if  the 
number  of  processors  is  equal  to  the  number  of  tensor  elements,  cyclic  and  blocked  layouts  arc  the  same  and 
no  structure  is  preserved).  However,  the  relative  padding  overhead  decreases  as  we  increase  the  tensor  size, 
so  we  expect  the  cyclic  layout  to  at  least  maintain  good  weak  scalability. 

4.2  Overdecomposition 

Our  cyclic  decomposition  allows  us  to  preserve  symmetric  fill  structure  so  long  as  we  satisfy  the  requirement 
that  all  symmetric  tensor  indices  arc  decomposed  with  the  same  cyclic  phase.  This  requirement  presents 
a  significant  challenge  to  the  naive  method  of  assigning  a  single  block  to  each  processor,  since  for  a  n- 
dimensional  symmetric  index  group,  it  would  require  the  number  of  processors  to  have  an  integer  nth  root. 
This  restriction  could  be  overcome  by  using  a  subset  of  the  processors  to  do  the  computation,  but  this  would 
sacrifice  efficiency  and  generality.  Our  solution  is  to  add  a  layer  of  abstraction  between  the  decomposition 
and  the  machine  by  assigning  multiple  sub-tensors  to  each  processor. 

Cyclops  Tensor  Framework  has  the  capability  to  overdecompose  the  tensor  into  a  number  of  blocks  that 
is  a  multiple  of  the  number  of  processors.  We  can  represent  a  (/-dimensional  torus  processor  grid  as  a  (/-tuple 
(pi, . .  -Pd),  where  pd  €  {1,  2, . . .}.  In  CTF,  each  n-dimensional  tensor  has  n  mappings  {m\  . . .  mn)  for 
each  index,  where  each  mapping  consists  of  a  list  of  processor  grid  dimension  p,  and  an  integer  overdecom¬ 
position  factor  v,  so  that  m,  =  (v,  q)  where  r  £  {1,2,.. q  £  {0, 1, . . .  d}.  Defining  po  =  k  the  cyclic 
phase  of  the  tensor  sub-block  along  dimension  i  is  then  given  by  v  ■  pq. 

CTF  attempts  to  use  the  least  amount  of  overdecomposition,  since  larger  blocks  typically  achieve  higher 
matrix  multiplication  efficiency.  The  overdecomposition  factor  is  then  set  to  be  equal  to  the  least  common 
multiple  (1cm)  of  the  physical  dimensions  to  which  indices  in  the  symmetric  group  arc  mapped.  For  instance 
if  a  tensor  has  a  3-index  symmetry  with  the  first  two  indices  mapped  along  processor  grid  dimensions  pi 
and  p2,  we  will  have  the  three  indices  mapped  as 

mi  =  (lcm(pi,p2)/pi,  1) 
m2  =  (lcm(p1,p2)/p2,  2) 
m3  =  (lcm(pi,p2),0). 

We  note  that  the  overall  phase  of  these  mappings  is  the  same,  so  the  symmetric  fill  structure  of  the  index 
group  is  preserved  within  each  sub-tensor  block.  Further,  we  see  that  overdecomposition  allows  a  tensor  to 
be  distributed  among  any  number  of  processors  (in  this  case  6),  whereas  a  naive  decomposition  would  have 
required  a  processor  count  of  {1,  8,  27, . . .},  and  would  have  very  limited  capability  for  topology-aware 
mapping. 

We  do  not  use  a  dynamically  scheduled  overdecomposition  approach  such  as  that  of  the  Charm++  run¬ 
time  system  [24].  Instead,  our  overdecomposition  is  set  so  that  the  dimensions  of  the  cyclic  decomposition 
are  a  multiple  of  the  physical  torus  dimensions  (by  the  factors  v  as  above)  and  generate  a  regular  map¬ 
ping.  For  dense  tensors,  our  approach  maintains  perfect  load-balance  and  achieves  high  communication 
and  task  granularity  by  managing  each  overdecomposed  sub-grid  of  blocks  explicitly  within  each  processor. 
However,  we  are  exploring  the  utility  of  dynamic  scheduling  in  allowing  CTF  to  efficiently  support  sparse 
tensors. 
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4.3  Redistribution  of  data 


As  we  have  seen  with  overdecomposition,  the  symmetries  of  a  tensor  place  requirements  on  the  cyclic  de¬ 
composition  of  the  tensor.  Further  parallel  decomposition  requirements  arise  when  the  tensor  participates 
in  a  summation  or  contraction  with  other  tensors,  in  order  to  properly  match  up  the  index  mappings  among 
the  tensors.  Therefore,  it  is  often  necessary  to  change  the  mapping  of  the  tensor  data  between  each  contrac¬ 
tion.  In  general,  we  want  to  efficiently  transform  the  data  between  any  two  given  index  to  processor  grid 
mappings. 

Further,  redistribution  of  data  is  necessary  if  the  framework  user  wants  to  read  or  write  parts  of  the 
tensor.  We  enable  a  user  to  read/write  to  any  set  of  global  indices  from  any  processor  in  a  bulk  synchronous 
sparse  read/write  call.  Additionally,  CTF  allows  the  user  to  extract  any  sub-block  from  a  tensor  and  define  it 
as  a  new  CTF  tensor.  CTF  has  three  redistribution  kernels  of  varying  generality  and  efficiency,  all  of  which 
arc  used  in  the  execution  of  coupled-cluster  calculations.  The  functionality  of  each  kernel  is  summarized  as 
follows  (the  kernels  are  listed  in  order  of  decreasing  generality  and  increasing  execution  speed) 

•  sparse  redistribution  -  write  or  read  a  sparse  set  of  data  to  or  from  a  dense  mapped  CTF  tensor 

•  dense  redistribution  -  shuffles  the  data  of  a  dense  tensor  from  one  CTF  mapping  to  any  other  CTF 
mapping 

•  block  redistribution  -  shuffling  the  assignment  of  the  tensor  blocks  to  processors  without  changing 
the  ordering  of  data  within  tensor  blocks 

Of  these  three  kernels,  the  second  is  the  most  complex,  the  most  optimized,  and  is  used  the  most  intensively 
during  CCSD/CCSDT  calculations.  The  third  kernel  serves  as  a  faster  version  of  dense  redistribution.  The 
sparse  redistribution  kernel  currently  serves  for  input  and  output  of  tensor  data,  but  in  the  future  could  also 
be  used  internally  to  execute  sparse  tensor  contractions.  We  detail  the  architecture  of  each  kernel  below. 

4.3.1  Sparse  redistribution 

There  arc  multiple  reasons  for  CTF  to  support  sparse  data  redistribution,  including  data  input  and  output, 
verification  of  the  dense  redistribution  kernel,  and  support  for  sparse  tensor  storage.  Our  algorithm  for 
reading  and  writing  sparse  data  operates  on  a  set  of  key-value  pairs  to  read/write.  The  key  is  the  global  data 
index  in  the  tensor.  For  a  tensor  of  dimension  d  and  edge  lengths  (Zi ,  Z2 ,  -  -  - ,  Id),  the  value  at  the  position 
described  by  the  tuple  {h,i2,  ■  ■  ■  ,h),  0  <  ik  <  h  ■  V/c  is  given  by, 

d  /  k- 1  \ 

^global  =  ^  ]  I  i'k  |  lm  I  • 
fc=l  \  m=  1  / 


The  sparse  algorithm  is  as  follows: 

1 .  sort  the  keys  by  global  index 

2.  place  the  local  key/value  pairs  into  bins  based  on  the  indices’  block  in  the  final  distribution 

3.  collect  the  bins  for  each  block  into  bins  for  each  destination  processor 

4.  exchange  keys  among  processors  via  all-to-all  communication 

5.  combine  the  received  keys  for  each  block  and  sort  them  by  key  (the  data  is  in  global  order  within  a 
block,  but  not  across  multiple  blocks) 
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6.  iterate  over  the  local  data  of  the  target  tensor,  computing  the  key  of  each  piece  of  data  and  performing 
a  read  or  write  if  the  next  sparse  key  is  the  same 

7.  if  the  operation  is  a  read,  send  the  requested  values  back  to  the  processors  which  requested  them 

This  sparse  read/write  algorithm  can  be  employed  by  a  kernel  which  goes  between  two  mappings  of  a  tensor 
by  iterating  over  local  data  and  turning  it  into  sparse  format  (key-value  pairs),  then  calling  a  sparse  write 
on  a  zero  tensor  in  the  new  mapping.  Further,  by  extracting  only  the  values  present  in  the  sparse  block  of  a 
tensor  and  changing  the  offsets  of  the  keys,  we  allow  general  sub-block  to  sub-block  redistribution. 

This  redistribution  kernel  is  very  general  and  fairly  straightforward  to  implement  and  thread,  however, 
it  clearly  does  redundant  work  and  communication  for  a  redistribution  between  two  mappings  due  to  the 
formation,  sorting,  and  communication  of  the  keys.  If  the  amount  of  data  stored  locally  on  each  processor  is 
n,  this  algorithm  requires  ()(n  log  n)  local  work  and  memory  traffic  for  the  sort.  The  next  kernel  we  detail 
is  specialized  to  perform  this  task  in  a  more  efficient  manner. 

4.3.2  Dense  redistribution 

Between  contractions,  it  is  often  the  case  that  a  CTF  tensor  must  migrate  from  one  mapping  to  another. 
These  mappings  can  be  defined  on  processor  grids  of  different  dimension  and  can  have  different  overall 
cyclic  phases  along  each  dimension.  This  change  implies  that  the  padding  of  the  data  also  changes,  and  it 
becomes  necessary  to  move  only  the  non-padded  data.  So,  we  must  iterate  over  the  local  data,  ignoring  the 
padding,  and  send  each  piece  of  data  to  its  new  destination  processor,  then  recover  the  data  on  the  destination 
processor  and  write  it  back  to  the  new  local  blocks  in  the  proper  order. 

Our  dense  redistribution  kernel  utilizes  the  global  ordering  of  the  tensor  to  avoid  forming  or  commu¬ 
nicating  keys  with  the  index  of  data  values.  The  kernel  places  the  values  into  the  send  buffers  in  global 
order,  performs  the  all-to-all  then  retrieves  them  from  the  received  buffers  using  the  knowledge  that  the  data 
is  in  global  order  and  thereby  computing  which  processor  sent  each  value.  The  same  code  with  swapped 
parameters  is  used  to  compute  the  destination  processor  from  the  sender  side,  as  well  as  to  compute  the 
sender  processor  from  the  destination  side.  The  kernel  is  threaded  by  partitioning  the  tensor  data  among 
threads  according  to  global  order.  We  ensure  a  load-balanced  decomposition  of  work  between  threads,  by 
partitioning  the  global  tensor  in  equal  chunks  according  to  its  symmetric-packed  layout,  and  having  each 
thread  work  on  the  local  paid  of  this  global  partition,  which  is  now  balanced  because  the  layout  is  cyclic. 
Another  important  optimization  to  the  kernel,  which  significantly  lowered  the  integer-arithmetic  cost,  was 
precomputing  the  destination  blocks  along  each  dimension  of  the  tensor.  For  instance,  to  redistribute  an 
n-  by  -n  matrix  from  a  pr-by-pc  grid  to  another  distribution,  we  precompute  two  destination  vectors  v  and  w 
of  size  n/pr  and  n/pc,  respectively,  on  each  processor,  which  allow  us  to  quickly  determine  the  destination 
of  local  matrix  element  a,?  via  look  ups  to  v1:  and  Wj. 

The  code  necessary  to  implement  this  kernel  is  complex  because  it  requires  iterating  over  local  tensor 
blocks  data  in  global  order,  which  requires  striding  over  blocks,  making  the  index  arithmetic  complex  and 
the  access  pattern  non-local.  However,  overall  the  algorithm  performs  much  less  integer  arithmetic  than 
the  sparse  redistribution  kernel,  performing  0(n )  work  and  requiring  ()(n)  memory  reads  (although  with 
potentially  with  less  locality  than  sorting),  if  the  local  data  size  is  n.  Further,  the  all-to-all  communication 
required  for  this  kernel  does  not  need  to  communicate  keys  along  with  the  data  like  in  the  sparse  redistri¬ 
bution  kernel.  In  practice,  we  found  the  execution  time  to  be  roughly  10X  faster  than  the  sparse  kernel. 
Nevertheless,  these  dense  redistributions  consume  a  noticeable  fraction  of  the  execution  time  of  CTF  during 
most  CC  calculations  we  have  tested. 
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4.3.3  Block  redistribution 

Often,  it  is  necessary  to  perform  a  redistribution  that  interchanges  the  mapping  assignments  of  tensor  di¬ 
mensions.  For  instance,  if  we  would  like  to  symmetrize  or  desymmetrize  a  tensor,  we  need  to  add  the  tensor 
to  itself  with  two  indices  transposed.  In  a  distributed  memory  layout,  this  requires  a  distribution  if  either 
of  the  two  indices  are  mapped  onto  a  processor  grid  dimension.  If  the  tensor  was  in  a  mapping  where  the 
two  indices  had  the  same  cyclic  phase  (it  had  to  if  the  indices  were  symmetric),  it  suffices  to  redistribute  the 
blocks  of  the  tensor.  With  this  use-case  in  mind,  we  wrote  a  specialized  kernel  that  goes  from  any  pair  of 
mappings  with  the  same  overall  cyclic  phases  and  therefore  the  same  blocks. 

We  implemented  this  type  of  block  redistribution  by  placing  asynchronous  receive  calls  for  each  block  on 
each  processor,  sending  all  the  blocks  from  each  processor,  and  waiting  for  all  the  exchanges  to  finish.  Since 
blocks  arc  stored  contiguously  they  arc  already  serialized,  so  this  kernel  requires  no  local  computational 
work  and  has  a  low  memory-bandwidth  cost.  However,  this  kernel  still  incurs  the  network  communication 
cost  of  an  all-to-all  data  exchange.  This  block  redistribution  kernel  is  used  whenever  possible  in  place  of  the 
dense  redistribution  kernel. 

5  Algorithms  for  tensor  contraction 

In  the  previous  section  we’ve  established  the  mechanics  of  how  CTF  manages  tensor  data  redistribution 
among  mappings.  This  section  will  focus  on  detailing  the  algorithms  CTF  uses  to  select  mappings  for 
contractions  and  to  perform  the  tensor  contraction  computation.  We  start  by  discussing  what  happens  to 
tensor  symmetries  during  a  contraction  and  how  symmetry  may  be  exploited.  Then  we  detail  how  the 
intra-node  contraction  is  performed  and  how  the  inter-node  communication  is  staged.  Lastly,  we  detail  how 
full  contraction  mappings  arc  automatically  selected  based  on  performance  models  of  redistribution  and 
distributed  contraction. 

5.1  Tensor  folding 

Any  tensor  contraction  of  A  6  and  B  6  M/f,i  X-Ib™  into  C  6  WJci  is  an  assignment  of 

index  labels  to  each  dimension  of  the  three  tensors,  such  that  every  label  is  assigned  to  two  dimensions  in 
different  tensors.  This  labeling  is  a  tripartite  graph  G  =  ( V.  E),  with  partitions  V  =  V a  U  V'r  U  Vq,  where 
each  vertex  is  a  tensor  dimension  corresponding  to  tensor  indices  ia,  ib,  and  i(j.  Each  vertex  in  V  must 
be  adjacent  to  a  single  edge  in  E ,  which  corresponds  to  its  matching  index.  The  two  tensor  edge  lengths  of 
the  matched  index  must  be  the  same.  We  assign  t  =  \E\  =  |Vj/2  =  (l  +  m  +  n)/ 2  labels  to  the  set  of 
edges  E,  ie,  and  define  three  projection  and  permutation  mappings  p,\  6  ),e  — >•  M,  pn  €  ij?  ip>,  and 
PC  F  —■ ►  ic-  These  projection  mappings  (indices)  now  completely  define  the  element-wise  contraction 
over  the  index  space  {I i, . . .  /)  }, 

\He  €  {1j  •  •  •  II}  x  .  .  .  {1,  .  .  .  It}  ■  cpc(iE)  =  cpc(iE )  “1“  apA Fi? )  ' 

For  example,  matrix  multiplication  (C  =  C  +  A  •  B)  may  be  defined  by  mappings  p  ,\  =  ( i,j ,  k )  —>  {i,  k), 
Pb  =  — G  (k,  j)  and  pc  =  ( i,j,k )  — >  ( i,j ).  These  projections  are  provided  to  the  framework  via 

the  domain  specific  language  in  Section  3.1.  The  code  is  in  fact  somewhat  more  general  than  the  definition 
of  contractions  we  analyze  here,  as  it  allows  the  same  index  to  appeal-  in  all  three  tensors  (weigh  operation) 
as  well  as  only  one  of  three  tensors  (reduction). 

Tensor  contractions  reduce  to  matrix  multiplication  via  index  folding.  Index  folding  corresponds  to 
transforming  sets  of  indices  into  larger  compound  indices,  and  may  necessitate  transposition  of  indices.  We 
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define  a  folding  of  index  set  s  =  {ii, . . .  in }  into  a  single  compound  index  q  as  a  one-to-one  map 

n 

i= 1 

for  instance  with  s  =  {i ,  j,  k},  we  have  q  =  f(s)  =  i  + 1 \  ■  j  + 1  \  ■  I-2-  k.  Given  a  two-way  partition  function 
(7Ti(s),  7T2(s)),  and  two  index  set  foldings  /  and  g,  we  define  a  matrix  folding  m(s)  =  (/(7Ti(s)),  <7(^2  (s))) 
as  a  folding  of  a  tensor’s  indices  into  two  disjoint  compound  indices.  Therefore,  A  =  m(  A)  with  elements 

®m(s) 

Any  contraction  can  be  folded  into  matrix  multiplication  in  the  following  manner,  by  defining  mappings 
f  An-  fnc-,  and  fAC  and  matrix  foldings 

mA(iA )  =  ( fAc{iAc),  fAB(iAB )), 
msilB)  =  (/as(*ab),/sc'(*bc')), 
mc(ic)  =  (fAc(iAc),  fBc(iBc)), 

where  ix  =  Px^-e)  for  all  X  G  {A,  B,C}  and  ixY  =  Px(ie)  H  py{ie)  for  all  X,  Y  G  {A,  B,  C}. 
Now  the  contraction  may  be  computed  as  a  matrix  multiplication  of  A  =  mA(A)  with  B  =  ms(B)  into 

C  =  mc(C) 

Cij  —  Cij  T  ^  "  O-ik  '  bkj 

k 

Tensors  can  also  have  symmetry,  we  denote  antisymmetric  (skew-symmetric)  index  groups  as 


[ii  .  -in\  [ii -ij 

for  any  j,  k  G  [1,  n\.  For  the  purpose  of  this  analysis,  we  will  only  treat  antisymmetric  tensors,  for  symmetric 
tensors  the  non-zero  diagonals  require  more  special  consideration.  Using  the  notation  I®n  =  I  x  •  •  •  x  I, 

(n—  1) -times 

we  denote  a  packed  (folded)  antisymmetric  compound  index  as  an  onto  map  from  a  packed  set  of  indices  to 
a  interval  of  size  binomial  in  the  tensor  edge  length 

so  given  a  simple  contraction  of  antisymmetric  tensors,  such  as, 

chi~  •  A  — s] , [A  — s+l  * *'^m]  ^hl  A  —  s] , [fl  ■■•■7s]  ^[jl  ■  "Js]  >  [A  —  s+1  ••  -im]  ’ 

h-js 

we  can  compute  it  in  packed  antisymmetric  layout  via  matrix  foldings 

mc({ii,  •  =  (f(k,  ■  ..ik-s),g{ik  — s+1 )  •  •  •  irn)') j 

mA({h,  ■  ..ik-s,ji,  •  •  • js })  =  (/Oi,  •  ■ -ik-s),h{ji,  ■  ■  ■ js )), 

mB({j l,  •  ■  -jsAks+iAm})  =  (h(ji, .  -.js),g(ik  — s+l>  •  •  •  *m))- 


So  that,  for  A  =  mA( A),  B  =  ms(B),  and  C  =  mc(C)  , 


V{*i  •  •  •  ik-s]  €  {1, . . .  I}®k~s,  {ik-s+i  ■  •  •  im}  €  {1,  •  •  • 

{jl...Js}G{l,.../r  : 

Cmc({h,— im})  =  Gnc({u  ■A— sdl  vfs})  (tfl >  ■  ■  •  A  h/c — 3+ 1  hm  }  ) 


Vi  G 


,3  e 


/ 

m  —  k  +  s 


o 

Gj  S  ■  ^  "  O-ik  '  ^kj  ■ 
k= 1 
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The  above  contraction  is  an  example  where  all  symmetries  arc  preserved.  Any  preserved  symmetries  must  be 
symmetries  of  each  tensor  within  the  whole  contraction  term.  We  can  consider  a  tensor  Z  corresponding  to 
the  uncontracted  set  of  scalar  multiplications  whose  entries  arc  pairs  associated  with  scalar  multiplications, 

^jl—jaik-s+l—im)' 

If  Z  is  symmetric  in  a  pair  of  indices,  we  call  this  symmetry  preserved.  Broken  symmetries  arc  symmetries 
which  exist  in  one  of  A,  B,  or  C,  but  not  in  Z.  For  example,  we  can  consider  the  contraction 

c[ij]kl  =  22  a[ij][pq]  '  bpk[ql] 
pq 

which  corresponds  to  contraction  graph  elements  zupkipq.  The  symmetry  \ij]  is  preserved  but  the  symme¬ 
tries  [pq]  and  \ql]  are  broken.  While  for  each  preserved  symmetry  in  a  contraction  we  achieve  a  reduction 
in  floating  point  operations,  for  broken  symmetries  we  currently  only  exploit  preservation  of  storage.  The 
broken  symmetries  can  be  unpacked  and  the  contraction  computed  as 

Cf(i,j),k,l  =  22  af(i,j),P,q  '  hP’kA’1 

p,q 

or  the  broken  symmetries  can  remain  folded,  in  which  case  multiple  permutations  arc  required, 

—  22  f(i,j),g(p,q)  '  b p,k,h(q,l )  ® f  (i,j) ,g(p,q)  '  ^p,k,h(l,q) 

p<q 

~  22  [®/(*d),s(g,p)  '  bp,k,h(q,l)  ~  ® f(i,j),g(q,p )  '  ^ p,k,h(l,q ) 

q<p 

Our  framework  makes  dynamic  decisions  to  unpack  broken  symmetries  in  tensors  or  to  perform  the  packed 
contraction  permutations,  based  on  the  amount  of  memory  available.  Unpacking  each  pair  of  indices  is 
done  via  a  tensor  summation  which  requires  either  a  local  transpose  of  data  of  a  global  transpose  with  inter¬ 
processor  communication,  which  can  usually  be  done  via  a  block-wise  redistribution  which  was  detailed  in 
Section  4.3.3.  It  may  be  possible  to  efficiently  reduce  or  avoid  the  desymmetrization  and  symmetrization 
calls  necessary  to  deal  with  broken  symmetries  and  we  arc  currently  exploring  possible  new  algorithms  that 
can  achieve  this.  However,  currently  unpacking  leads  to  highest  efficiency  due  to  the  ability  to  fold  the 
unpacked  indices  into  the  matrix  multiplication  call. 

5.2  On-node  contraction  of  tensors 

To  prepare  a  folded  form  for  the  on-node  contractions,  we  perform  non-symmetric  transposes  of  each  tensor 
block.  In  particular,  we  want  to  separate  out  all  indices  which  take  part  in  a  broken  symmetric  group  within 
the  contraction  and  linearize  the  rest  of  the  indices.  If  a  symmetry  group  is  not  broken,  we  can  simply  fold 
the  symmetric  indices  into  one  bigger  dimension  linearizing  the  packed  layout.  We  perform  an  ordering 
transposition  on  the  local  tensor  data  to  make  dimensions  which  can  be  folded,  as  fastest  increasing  and  the 
broken  symmetric  dimensions  as  slowest  increasing  within  the  tensors.  To  do  a  sequential  contraction,  we 
can  then  iterate  over  the  broken  symmetric  indices  (or  unpack  the  symmetry)  and  call  matrix  multiplication 
over  the  linearized  indices.  For  instance,  the  contraction  from  the  start  of  this  section, 

c[ij]kl  =  22  aM  M  ’  bpk[ql] 
pq 
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would  be  done  as  a  single  matrix  multiplication  for  each  block,  if  the  [pq]  and  [ql]  symmetries  arc  unpacked. 
However,  if  all  the  broken  symmetries  arc  kept  folded,  the  non-symmetric  transpose  would  make  fastest  in¬ 
creasing  the  folded  index  corresponding  to  as  well  as  the  k  index,  so  that  the  sequential  kernel  could 

iterate  over  p,  q,  l  and  call  a  vector  outer-product  operation  over  f(i,j ),  k  for  each  p.  q.  1.  The  underlying 
call  is  generally  a  matrix  multiplication,  though  in  this  case  it  reduces  to  an  outer-product  of  two  vectors, 
and  in  other  cases  could  reduce  to  a  vector  or  scalar  operation.  No  matter  what  the  mapping  of  the  tensor  is, 
the  non-symmetric  transpose  required  to  fold  the  tensor  blocks  can  be  done  locally  within  each  block  and 
requires  no  network  communication.  Therefore,  while  the  non-symmetric  permutations  present  an  overhead 
to  the  sequential  performance,  their  cost  decreased  comparatively  as  we  perform  strong  or  weak  scaling. 

5.3  Distributed  contraction  of  tensors 

In  CTF,  tensors  arc  distributed  in  cyclic  layouts  over  a  virtual  torus  topology.  The  distributed  algorithm  for 
tensor  contractions  used  by  CTF  is  a  combination  of  replication,  as  done  in  3D  and  2.5D  algorithms  [50]  and 
a  nested  SUMMA  [1,  53]  algorithm.  If  the  dimensions  of  two  tensors  with  the  same  contraction  index  arc 
mapped  onto  different  torus  dimensions,  a  SUMMA  algorithm  is  done  on  the  plane  defined  by  the  two  torus 
dimensions.  For  each  pair  of  indices  mapped  in  this  way,  a  nested  level  of  SUMMA  is  done.  The  indices 
must  be  mapped  with  the  same  cyclic  phase  so  that  the  blocks  match,  adding  an  extra  mapping  restriction 
which  necessitates  overdecomposition. 

We  employ  three  versions  of  SUMMA,  each  one  communicates  a  different  pair  of  the  the  three  tensors 
and  keeps  one  tensor  in  place.  The  operand  tensors  arc  broadcast  before  each  subcontraction,  while  the  out¬ 
put  tensor  is  reduced  after  each  subcontraction.  So,  in  one  version  of  SUMMA  (the  outer-product  version) 
two  broadcasts  arc  done  (one  of  each  operand  tensor  block)  and  in  the  other  two  of  the  three  versions  of 
SUMMA  one  reduction  is  done  (instead  of  one  of  the  broadcasts)  to  accumulate  partial  sums  to  blocks  of 
the  output.  If  the  processor  dimensions  onto  which  the  pair  of  indices  corresponding  to  a  SUMMA  call 
are  mapped  arc  of  different  lengths,  the  indices  must  be  overdecomposed  to  the  same  cyclic  phase,  and  our 
SUMMA  implementation  does  as  many  communication  steps  as  the  cyclic  phase  of  the  indices. 

Blocking  the  computation  on  3D  grids  (replicating  over  one  dimension)  allows  for  algorithms  which 
achieve  better  communication  efficiency  in  the  strong  scaling  limit.  Strong  scaling  efficiency  is  necessary 
for  higher  order  CC  methods  which  require  many  different  small  contractions  to  be  computed  rapidly.  When 
a  small  contraction  is  computed  on  a  large  number  of  processors  there  is  typically  much  more  memory  is 
available  than  the  amount  necessary  to  store  the  tensor  operands  and  output.  The  additional  memory  can 
be  exploited  via  replication  of  tensor  data  and  reduction  of  overall  communication  volume.  We  always 
replicate  the  smallest  one  of  the  three  tensors  involved  in  the  contraction  to  minimize  the  amount  of  memory 
and  communication  overhead  of  replication. 

5.4  Topology-aware  network  mapping 

Each  contraction  can  place  unique  restrictions  on  the  mapping  of  the  tensors.  In  particular,  our  decompo¬ 
sition  needs  all  symmetric  tensor  dimensions  to  be  mapped  with  the  same  cyclic  phase.  Further,  we  must 
satisfy  special  considerations  for  each  contraction,  that  can  be  defined  in  terms  of  indices  (we  will  call  them 
paired  tensor  dimensions)  which  are  shared  by  a  pair  of  tensors  in  the  contraction.  These  considerations  arc 

1 .  dimensions  which  arc  paired  must  be  mapped  with  the  same  phase 

2.  for  the  paired  tensor  dimensions  which  arc  mapped  to  different  dimensions  of  the  processor  grid  ( arc 
mismatched) 

(a)  the  mappings  of  two  pairs  of  mismatched  dimensions  cannot  share  dimensions  of  the  processor 
grid 
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(b)  the  subspace  formed  by  the  mappings  of  the  mismatched  paired  dimensions  must  span  all  input 
data 

When  the  physical  network  topology  is  a  d-dimensional  toroidal  partition  specified  by  the  d-tuple 
(p\ , . . .  p(j).  CTF  considers  all  foldings  of  physical  topology  that  preserve  the  global  ordering  of  the  proces¬ 
sors  (e.g.  p2  may  be  folded  with  pn  but  not  only  with  7)4).  If  the  physical  network  topology  is  not  specified 
as  a  torus,  CTF  factorizes  the  number  of  processes  up  to  8  factors  and  treats  the  physical  topology  as  an 
8-dimensional  processor  grid  and  attempts  to  map  to  all  lower-dimensional  foldings  of  it.  When  there  are 
more  tensor  indices  than  processor  grid  dimensions,  additional  grid  dimensions  of  length  1  are  added. 


Figure  2:  Overdecomposition  as  used  in  CTF  to  perform  contractions.  This  diagram  demonstrates  a  mapping 
for  a  contraction  of  the  form  =  V ;  «[•/&;]  •  b[ij}-  In  this  case,  we  have  a  4-by-2  processor  grid,  and  a 
4-by-4-by-4  set  of  blocks. 

For  any  given  topology,  CTF  attempts  to  map  all  three  possible  pairs  of  tensors  so  that  their  indices  are 
mismatched  on  the  processor  grid  and  they  are  communicated  in  the  contraction  kernel.  The  mapping  of  two 
tensors  automatically  defines  the  mapping  of  the  third,  though  the  overdecomposition  factor  must  be  adjusted 
to  satisfy  all  symmetries  and  matching  indices  among  tensors.  CTF  also  considers  partial  replication  of 
tensors  among  the  processor  grid  dimensions.  Figure  2  shows  an  example  of  an  overdecomposed  mapping 
with  a  properly  matched  cyclic  phase  along  all  tensor  dimensions. 

The  search  through  mappings  is  done  entirely  in  parallel  among  processors,  then  the  best  mapping  is 
selected  across  all  processors.  The  mapping  logic  is  done  without  reading  or  moving  any  of  the  tensor  data 
and  is  generally  composed  of  integer  logic  that  executes  in  an  insignificant  amount  of  time  with  respect 
to  the  contraction.  We  construct  a  ’ghost’  mapping  for  each  valid  topology  and  each  ordering  of  tensors. 
The  distributed  contraction  algorithm  is  constructed  on  each  ghost  mapping,  and  its  communication  and 
memory  overheads  are  evaluated.  If  the  ghost  mapping  is  suboptimal  it  is  thrown  out  without  ever  dictating 
data  movement.  Once  a  mapping  is  decided  upon,  the  tensors  are  redistributed.  Currently,  our  performance 
cost  models  predominantly  consider  communication,  with  controls  on  the  amount  of  overdecomposition  and 
memory  usage.  We  are  seeking  to  extend  this  cost  model  and  to  add  infrastructure  for  training  the  parameters 
of  our  performance  models. 

6  Application  performance 

The  implementation  of  CTF  employs  MPI  [17]  for  inteiprocessor  communication,  BLAS  for  matrix  multi¬ 
plication  and  summation,  as  well  as  OpenMP  for  threading.  All  other  functionalities  in  CTF  were  developed 
from  scratch  and  have  no  other  dependencies.  We  used  vendor  provided  optimized  on-node  parallel  BLAS 
implementations  (IBM  ESSL  and  Cray  LibSci)  on  all  architectures  for  benchmarking.  While  the  interface 
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is  C++,  much  of  the  internals  of  CTF  are  in  C-style  with  C++  library  support.  Outside  of  special  network 
topology  considerations  on  BlueGene/Q,  CTF  does  not  employ  any  optimizations  which  arc  specific  to  an 
architecture  or  an  instruction  set.  Performance  profiling  is  done  by  hand  and  with  TAU  [48]. 

6.1  Architectures 

Cyclops  Tensor  Framework  targets  massively  parallel  architectures  and  is  designed  to  take  advantage  of 
network  topologies  and  communication  infrastructure  that  scale  to  millions  of  nodes.  Parallel  scalability 
on  commodity  clusters  should  benefit  significantly  from  the  load  balanced  characteristics  of  the  workload, 
while  high-end  supercomputers  will  additionally  benefit  from  reduced  inter-processor  communication  which 
typically  becomes  a  bottleneck  only  at  very  high  degrees  of  parallelism.  We  collected  performance  results 
on  two  state-of-the-art  supercomputer  architectures,  IBM  Blue  Gene/Q  and  Cray  XC30.  We  also  tested 
sequential  and  multi-threaded  performance  on  a  Xeon  desktop. 

The  sequential  and  non-parallel  multi-threaded  performance  of  CTF  is  compared  to  NWChem  and 
MRCC.  The  platform  is  a  commodity  dual-socket  quad-core  Xeon  E5620  system.  On  this  machine,  we  used 
the  sequential  and  threaded  routines  of  the  Intel  Math  Kernel  Library.  This  platform,  as  well  as  the  problem 
sizes  tested  reflect  a  typical  situation  for  workloads  on  a  workstation  or  small  cluster,  which  is  where  the 
sequential  performance  of  these  codes  is  most  important.  Three  problem  sizes  are  timed,  spanning  a  variety 
of  ratios  of  the  number  of  virtual  orbitals  to  occupied  orbitals. 

The  second  experimental  platform  is  ‘Edison’,  a  Cray  XC30  supercomputer  with  two  12-core  Intel  “Ivy 
Bridge”  processors  at  2.4GHz  per  node  (19.2  Gflops  per  core  and  460.8  Gflops  per  node).  Edison  has  a 
Cray  Aries  high-speed  interconnect  with  Dragonfly  topology  (0.25 /is  to  3.7 fis  MPI  latency,  8  GB/sec  MPI 
bandwidth).  Each  node  has  64  GB  of  DDR3  1600  MHz  memory  (four  8  GB  DIMMs  per  socket)  and  two 
shared  30  MB  L3  caches  (one  per  Ivy  Bridge).  Each  core  has  its  own  LI  and  L2  caches,  of  size  64  KB  and 
256  KB. 

The  final  platform  we  consider  is  the  IBM  Blue  Gene/Q  (BG/Q)  architecture.  We  use  the  installations 
at  Argonne  and  Lawrence  Livermore  National  Laboratories.  On  both  installations,  IBM  ESSL  was  used 
for  BLAS  routines.  BG/Q  has  a  number  of  novel  features,  including  a  5D  torus  interconnect  and  16-core 
SMP  processor  with  4-way  hardware  multi-threading,  transactional  memory  and  L2-mediated  atomic  opera¬ 
tions  [19],  all  of  which  serve  to  enable  high  performance  of  the  widely  portable  MPI/OpenMP  programming 
model.  The  BG/Q  cores  run  at  1.6  GHz  and  the  QPX  vector  unit  supports  4-way  fused  multiply-add  for  a 
single-node  theoretical  peak  of  204.8  GF/s.  The  BG/Q  torus  interconnect  provides  2  GB/s  of  theoretical 
peak  bandwidth  per  link  in  each  direction,  with  simultaneous  communication  along  all  10  links  achieving 
35.4  GB/s  for  1  MB  messages  [10]. 

6.2  Results 

We  present  the  performance  of  a  CCSD  and  CCSDT  implementation  contained  within  Aquarius  and  exe¬ 
cuted  using  Cyclops  Tensor  Framework.  For  each  contraction,  written  in  one  line  of  Aquarius  code,  CTF 
finds  a  topology-aware  mapping  of  the  tensors  to  the  computer  network  and  performs  the  necessary  set  of 
contractions  on  the  packed  structured  tensors. 

6.2.1  Sequential  CCSD  performance 

The  results  of  the  sequential  and  multi-threaded  comparison  arc  summarized  in  Table  1.  The  time  per 
CCSD  iteration  is  lowest  for  NWChem  in  all  cases,  and  similarly  highest  for  MRCC.  The  excessive  iteration 
times  for  MRCC  when  the  —  ratio  becomes  small  reflect  the  fact  that  MRCC  is  largely  memory-bound, 
as  contractions  are  performed  only  with  matrix-vector  products.  The  multi-threaded  speedup  of  CTF  is 
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significantly  better  than  NWChem,  most  likely  due  to  the  lack  of  multi-threading  of  tensor  transposition  and 
other  non-contraction  operations  in  NWChem. 

Table  1:  Sequential  and  non-parallel  multi-threaded  performance  comparison  of  CTF,  NWChem,  and 
MRCC.  Entries  are  average  time  for  one  CCSD  iteration,  for  the  given  number  of  virtual  (nv)  and  occu¬ 
pied  (n0)  orbitals. 


nv  =  110 
nQ  =  5 

nv  =  94 
nQ  =  11 

nv  =  71 
nQ  =  23 

NWChem 

1  thread 

6.80  sec 

16.8  sec 

49.1  sec 

CTF 

1  thread 

23.6  sec 

32.5  sec 

59.8  sec 

MRCC 

1  thread 

31.0  sec 

66.2  sec 

224.  sec 

NWChem 

8  threads 

5.21  sec 

8.60  sec 

18.1  sec 

CTF 

8  threads 

9.12  sec 

9.37  sec 

18.5  sec 

MRCC 

8  threads 

67.3  sec 

64.3  sec 

86.6  sec 

Strong  scaling  CCSD  on  Edison 


Strong  scaling  CCSDT  on  Edison 


#nodes 


(a)  CCSD 


(b)  CCSDT 


Figure  3:  CCSD  and  CCSDT  strong  scaling  of  water  clusters  with  cc-pVDZ  basis  set  on  Edison  (Cray 
XC30). 


6.2.2  Strong  scaling 

On  the  Cray  XC30  machine,  we  compared  the  performance  of  our  CCSD  implementation  with  that  of 
NWChem.  We  benchmarked  the  two  codes  for  a  series  of  water  systems  -  wn,  where  n  is  the  number 
of  water  monomers  in  the  system.  Water  clusters  give  a  flexible  benchmark  while  also  representing  an 
interesting  chemical  system  for  their  role  in  describing  bulk  water  properties.  In  Figure  3(a),  we  compare 
the  scaling  of  CCSD  using  CTF  with  the  scaling  of  NWChem.  Our  version  of  NWChem  6.3  used  MPI-3 
and  was  executed  using  one  MPI  process  per  core.  The  tile  size  was  40  for  CCSD  and  20  for  CCSDT. 
The  CCSD  performance  achieved  by  CTF  becomes  significantly  higher  than  NWChem  when  the  number  of 
processors  used  for  the  calculation  is  increased.  CTF  can  both  solve  problems  in  shorter  time  than  NWChem 
(strong  scale)  and  solve  larger  problems  faster  (weak  scale).  Figure  3(b)  gives  the  CCSDT  strong  scaling 
comparison.  CCSDT  for  system  sizes  above  3  molecules  did  not  work  successfully  with  our  NWChem 
build,  but  a  large  performance  advantage  for  CTF  is  evident  for  the  smaller  systems. 
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Weak  scaling  on  BlueGene/Q 


Weak  scaling  on  Edison 


Figure  4:  CCSD  and  CCSDT  weak  scaling  on  water  clusters  with  cc-pVDZ  basis  set. 


Table  2:  A  performance  breakdown  of  important  kernels  for  a  CCSD  iteration  done  by  CTF  on  a  system 
with  n0  =  125  occupied  orbitals  and  nv  =  600  virtual  orbitals  on  256  nodes  of  Edison  (XC30)  and  1024 
nodes  of  Mira  (BG/Q).  Complexity  is  in  terms  of  p  processors  and  M  memory  per  processor. 


kernel 

BG/Q 

XC30 

complexity 

architectural  bounds 

matrix  mult. 

49% 

35% 

0(n$n2Jp ) 

flops/mem  bandwidth 

broadcasts 

24% 

37% 

0{nln20/pVM) 

multicast  bandwidth 

prefix  sum 

10% 

4% 

0(p) 

allreduce  bandwidth 

data  packing 

4% 

3% 

0(n2vn2Jp ) 

integer  ops 

all-to-all-v 

3% 

10% 

0(n2vn2Jp ) 

bisection  bandwidth 

tensor  folding 

6% 

5% 

0(n2n2/p) 

memory  bandwidth 

other 

4% 

6% 

6.2.3  Weak  scaling 

The  parallel  weak  scaling  efficiency  of  our  CCSD  and  CCSDT  implementation  on  Blue  Gene/Q  is  displayed 
in  Figure  4(a)  and  on  Cray  XC30  is  displayed  in  Figure  4(b).  This  weak  scaling  data  was  collected  by 
doing  the  largest  CCSD  and  CCSDT  run  that  would  fit  in  memory  on  each  node  count  and  normalizing  the 
efficiency  by  the  number  of  floating  point  operations  performed  by  CTF.  Going  from  512  to  16384  nodes 
(256K  cores),  the  efficiency  actually  often  increases,  since  larger  problems  can  be  solved,  which  increases 
the  ratio  of  computation  over  communication.  For  CCSD,  CTF  maintains  30  GF/node  on  BG/Q,  which 
is  about  one  sixth  of  peak.  On  Edison,  CCSD  surpasses  200  GF/node,  which  is  over  50%  of  peak.  The 
application  was  run  with  4  MPI  processes  per  node  and  16  threads  per  process  on  BG/Q  and  with  4  MPI 
processes  per  node  and  6  threads  per  process  on  Edison. 

Table  2  lists  profiling  data  for  a  CCSD  iteration  of  CTF  on  1024  nodes  (16K  cores)  of  BG/Q  and  256 
nodes  (6K  cores)  of  Edison  on  25  water  molecules  with  a  basis  set  size  of  600  virtual  orbitals.  This  problem 
took  228  seconds  on  1024  nodes  of  BG/Q  and  167  seconds  on  256  nodes  of  Edison.  The  table  reports  the 
percentage  of  execution  time  of  a  CCSD  iteration  spent  in  the  main  kernels  in  CTF.  The  table  also  lists  the 
architectural  bounds  for  each  kernel,  demonstrating  the  components  of  the  hardware  being  utilized  by  each 
computation.  The  time  inside  matrix  multiplication  reflects  the  time  spent  working  on  tensor  contractions 
sequentially  on  each  processor,  while  broadcasts  represent  the  time  spent  communicating  data  for  replica¬ 
tion  and  the  nested  SUMMA  algorithm.  The  prefix  sum,  data  packing,  and  all-to-all-v  operations  are  all 
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Table  3:  A  performance  breakdown  of  important  kernels  for  a  CCSDT  iteration  done  by  CTF  on  a  system 
with  8  water  molecules  (n0  =  40  occupied  orbitals)  and  a  cc-pVDZ  basis  set  (nv  =  192  virtual  orbitals)  on 
256  nodes  of  Edison  (XC30)  and  1024  nodes  of  Mira  (BG/Q).  Complexity  is  in  terms  of  p  processors  and 
M  memory  per  processor. 


kernel 

BG/Q 

XC30 

complexity 

architectural  bounds 

matrix  mult. 

20% 

16% 

0(n%ri*/p) 

flops/mem  bandwidth 

broadcasts 

16% 

11% 

0(n^v?0/p\fM) 

multicast  bandwidth 

prefix  sum 

5% 

3% 

O(p) 

allreduce  bandwidth 

data  packing 

9% 

9% 

0(n6vn60/p) 

integer  ops 

all-to-all-v 

19% 

20% 

0(n%n%/p) 

bisection  bandwidth 

tensor  folding 

26% 

39% 

0(ntno/p) 

memory  bandwidth 

other 

5% 

2% 

part  of  tensor  redistribution,  which  is  a  noticeable  overall  overhead.  Tensor  folding  corresponds  to  local 
transposition  of  each  tensor  block. 

Table  2  yields  the  somewhat  misleading  initial  observation  that  nearly  half  the  execution  is  spent  in  ma¬ 
trix  multiplication  on  BG/Q,  14%  more  than  than  the  fraction  spent  in  matrix  multiplication  on  Edison.  In 
fact,  we  observe  this  difference  for  two  reasons,  because  the  BG/Q  run  uses  four  times  the  number  of  pro¬ 
cessors  causing  there  to  be  more  redundant  computation  on  padding  in  the  decomposition,  and  because  the 
matrix  multiplication  library  being  employed  on  BG/Q  achieves  a  significantly  lower  fraction  of  peak  than 
on  Edison  for  the  invoked  problem  sizes.  Overall,  the  computational  efficiency  of  our  CCSD  implementation 
is  more  favorable  on  Edison  than  on  BG/Q. 

For  CCSDT,  Aquarius  and  CTF  maintain  a  lower  percentage  of  peak  than  for  CCSD,  but  achieves  good 
parallel  scalability.  The  lower  fraction  of  peak  is  expected  due  to  the  CCSDT  increased  relative  cost  and 
frequency  of  transpositions  and  summations  with  respect  to  CCSD.  In  particular  CCSD  performs  0(n3/2) 
computation  with  0(n )  data  while  CCSDT  performs  0(n4,/3)  computation  with  0{n)  data.  The  third-order 
excitation  tensor  T3  present  only  in  CCSDT  has  6  dimensions  and  its  manipulation  (packing,  unpacking, 
transposing,  and  folding)  becomes  an  overhead.  We  observe  this  trend  in  the  CCSDT  performance  profile 
on  Table  3,  where  tensor  transposition  and  redistribution  take  up  a  larger  fraction  of  the  total  time  than  in  the 
CCSD  computation  considered  in  Table  2.  The  8-water  CCSDT  problem  in  Table  3  took  15  minutes  on  1024 
nodes  of  BG/Q  and  21  minutes  on  256  nodes  of  Edison.  The  strong  scalability  achievable  for  this  problem 
is  significantly  better  on  Edison,  increasing  the  number  of  nodes  by  four,  BG/Q  performs  such  a  CCSDT 
iteration  (using  4096  nodes)  in  9  minutes  while  Edison  computes  it  (using  1024  nodes)  in  6  minutes. 

7  Future  work 

CTF  provides  an  efficient  distributed-memory  approach  to  dense  tensor  contractions.  The  infrastructure 
presented  in  the  paper  may  be  extended  conceptually  and  is  being  improved  in  practice  by  more  robust 
performance  modelling  and  mapping  logic.  Another  promising  direction  is  the  addition  of  inter-contraction 
parallelism,  allowing  for  smaller  contractions  to  be  scheduled  concurrently.  CTF  already  supports  working 
on  subsets  of  nodes  and  moving  data  between  multiple  instances  of  the  framework,  providing  the  necessary 
infrastructure  for  a  higher  level  scheduling  environment. 

From  a  broader  applications  standpoint  the  addition  of  sparsity  support  to  CTF  is  a  particularly  attractive 
future  direction.  Working  support  for  distributed  memory  tensor  objects  would  allow  for  the  expression  of 
a  wide  set  of  numerical  schemes  based  on  sparse  matrices.  Further  tensors  with  banded  sparsity  structure 
can  be  decomposed  cyclically  so  as  to  preserve  band  structure  in  the  same  way  CTF  preserves  symmetry. 
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Tensors  with  arbitrary  sparsity  can  also  be  decomposed  cyclically,  though  the  decomposition  may  need  to 
perform  load  balancing  in  the  mapping  and  execution  logic. 

Cyclops  Tensor  Framework  will  also  be  integrated  with  a  higher-level  tensor  expression  manipulation 
framework  as  well  as  CC  code  generation  methods.  We  have  shown  a  working  implementation  of  CCSD 
and  CCSDT  on  top  of  CTF,  but  aim  to  implement  more  complex  methods.  In  particular,  we  are  targeting 
the  CCSDTQ  method,  which  employs  tensors  of  dimension  up  to  8  and  gets  the  highest  accuracy  of  any 
desirable  CC  method  (excitations  past  quadruples  have  a  negligible  contribution). 
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